Tip: This is Financial Contributions to Presidential Campaigns in WA data set.
## 'data.frame': 292317 obs. of 9 variables:
## $ cand_nm : Factor w/ 24 levels "Bush, Jeb","Carson, Benjamin S.",..: 4 19 19 4 4 22 19 4 19 19 ...
## $ contbr_nm : Factor w/ 49817 levels "'CALL, ALLAN",..: 11139 23477 26330 15549 13027 34466 26698 20215 24250 24250 ...
## $ contbr_city : Factor w/ 719 levels "","1644 WINDEMERE DR.",..: 495 560 560 301 560 630 321 576 243 243 ...
## $ contbr_zip : Factor w/ 53472 levels "","00159","00160",..: 38469 12154 22416 7522 16517 39786 36566 24929 35810 35810 ...
## $ contbr_employer : Factor w/ 13373 levels "","''I LIKE COMICS''",..: 1 7617 10388 1 4979 5404 7761 10063 7761 7761 ...
## $ contbr_occupation: Factor w/ 8260 levels "","-"," CERTIFIED REGISTERED NURSE ANESTHETIS",..: 2111 4693 7485 6164 2160 3514 4693 5707 4693 4693 ...
## $ contb_receipt_amt: num 25 27 50 55 18.9 ...
## $ contb_receipt_dt : Factor w/ 671 levels "01-APR-15","01-APR-16",..: 505 78 121 414 350 212 121 414 78 121 ...
## $ election_tp : Factor w/ 5 levels "","G2016","O2016",..: 4 4 4 4 4 2 4 4 4 4 ...
## cand_nm contbr_nm
## Clinton, Hillary Rodham :126190 BUNSON, JAMIE : 282
## Sanders, Bernard :121555 TREIBEL, RANDY : 280
## Trump, Donald J. : 16222 BUCKLEY, MARK : 262
## Cruz, Rafael Edward 'Ted': 13357 WATSON, DONNA : 256
## Carson, Benjamin S. : 8085 BENACK, MARY ANN : 214
## Rubio, Marco : 2192 SOMERVILLE, DALENE: 191
## (Other) : 4716 (Other) :290832
## contbr_city contbr_zip contbr_employer
## SEATTLE : 82399 98053 : 286 : 42728
## VANCOUVER : 9555 981556413: 161 NONE : 28069
## OLYMPIA : 8965 98033 : 156 RETIRED : 26858
## BELLINGHAM: 8105 98004 : 152 SELF-EMPLOYED: 15757
## TACOMA : 8075 981882718: 145 NOT EMPLOYED : 15214
## BELLEVUE : 7942 991633631: 133 SELF : 8269
## (Other) :167276 (Other) :291284 (Other) :155422
## contbr_occupation contb_receipt_amt contb_receipt_dt
## RETIRED : 58036 Min. :-8432.99 31-MAR-16: 3443
## NOT EMPLOYED : 38766 1st Qu.: 15.00 29-FEB-16: 3323
## INFORMATION REQUESTED: 6361 Median : 27.00 31-MAY-16: 2576
## SOFTWARE ENGINEER : 5494 Mean : 82.95 30-MAR-16: 2549
## TEACHER : 5035 3rd Qu.: 60.00 30-APR-16: 2540
## ATTORNEY : 4753 Max. :10800.00 09-MAR-16: 2312
## (Other) :173872 (Other) :275574
## election_tp
## : 364
## G2016: 89381
## O2016: 109
## P2016:202462
## P2020: 1
##
##
In 2016 Presidential Campaign Finance, contributor raw data in WA have 292317 observations and 19 variables, each observation indicates a donation transaction. And then I delete 10 columns by python. So the data import to R have 9 variables and 292317 observations.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8432.99 15.00 27.00 82.95 60.00 10800.00
summary the contb_receipt_amt and find that there are some minus value. Obviously, we need to delete these observations.
## [1] 289340 9
now the data set have 289340 observations.
obviously, most donate amount are small and there are some extremely high amount value outliers. Consequently, it is neccessary to use log10 transform for better distribution histogram.
##
## Shapiro-Wilk normality test
##
## data: log_normal_sample
## W = 0.97459, p-value < 2.2e-16
## 80%
## 100
we can see that 80% of the donates are less than 100$.
And then, I need to add more variables to explore potential interesting pattern.
##
## female male
## 148058 128403
number of male and female contributors
##
## female male
## 3 21
##
## democrat others republican
## 5 3 16
number of male and female candidates number of party candidates
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 14.0 403.5 133.0 81773.0
## [1] 717 2
## 95%
## 1423
There are 717 different city because of misspelling, shorten or long detail name. One solution is to use zipcode to cross-validate.
No surprise, the outstanding bar is Seattle.
##
## Shapiro-Wilk normality test
##
## data: log_normal_sample
## W = 0.91347, p-value < 2.2e-16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 4.00 35.09 11.00 57491.00
## [1] 8245 2
## 95%
## 59
There are 8260 different occupation! 95% of them have less than 59 people.
##
## Shapiro-Wilk normality test
##
## data: log_normal_sample
## W = 0.90732, p-value < 2.2e-16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 16.0 93.0 447.9 538.2 7654.0
## [1] 646 2
There 646 different zipcode.
##
## Shapiro-Wilk normality test
##
## data: log_normal_sample
## W = 0.97488, p-value = 4.424e-09
Moreover, I add additional variables about demographic data to the dataset.
##
## Shapiro-Wilk normality test
##
## data: normal_sample
## W = 0.98395, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: normal_sample
## W = 0.94319, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: normal_sample
## W = 0.98174, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: normal_sample
## W = 0.94666, p-value < 2.2e-16
There are 289405 observations of 22 variables.
There are 9 original variables:
* cand_nm
* contbr_nm
* contbr_zip
* contbr_city
* contbr_employer
* contbr_occupation
* contb_receipt_amt
* contb_receipt_dt
* election_tp
There are 5 derived variables:
* party
* cand_first_nm
* cand_gender
* contbr_first_nm
* contbr_gender
There are 8 demographic variables coming from df_zip_demographics data set:
* total_population
* percent_white
* percent_black
* percent_asian
* percent_hispanic
* per_capita_income
* median_rent
* median_age
There are a lot of categorical but no ordered factor.
The main features of interest in the data set are contbr_amt, party, cand_name, contbr_gender, total_population, per_capita_income, percent_white and median_age. I hope these variables can be used to build a predictive model to predict contribution amount to which party.
I think contbr_zip, contbr_occupation, contb_receipt_dt, cand_gender, median_rent, percent_black, ercent_asian, percent_hispanic, election_tp are all relevant to contribution amount.
There are 5 derived variables:
* party
* cand_first_nm
* cand_gender
* contbr_first_nm
* contbr_gender
cand_name >- party
cand_name >- cand_first_nm >- cand_gender
contbr_nm >- contbr_first_nm >- contbr_gender
The contribution have amount less than 0 and larger than 2700, which is illegal. I just remove these rows.
The largest three group of contributors by occupation are retirees, NOT EMPLOYED and INFORMATION REQUESTED.
The election type have one blank type, no idea what it is.
There are 717 different city because of misspelling, shorten or long detail name. One solution is to use zipcode to cross-validate.
There are 8260 different occupation! 95% of them have less than 59 people.
##
## Two-Step Estimates
##
## Correlations/Type of Correlation:
## contb_receipt_amt election_tp party contbr_gender
## contb_receipt_amt 1 Polyserial Polyserial Polyserial
## election_tp -0.1045 1 Polychoric Polychoric
## party 0.1345 0.1811 1 Polychoric
## contbr_gender 0.05219 0.1719 0.2495 1
## cand_gender -0.08513 0.7649 0.7235 0.3392
## total_population -0.01675 -0.02927 -0.08306 -0.004154
## percent_white -0.001675 0.04469 0.08542 -0.03774
## percent_black -0.00345 -0.03836 -0.1544 0.03451
## percent_asian 0.04366 -0.09241 -0.1807 0.02302
## percent_hispanic -0.02905 0.02757 0.146 0.02988
## per_capita_income 0.1326 -0.1275 -0.2163 0.006958
## median_rent 0.09303 -0.1078 -0.1644 -0.001523
## median_age -0.0006057 0.02908 0.05621 -0.04872
## time -0.01964 -0.9579 -0.2709 -0.1579
## cand_gender total_population percent_white percent_black
## contb_receipt_amt Polyserial Pearson Pearson Pearson
## election_tp Polychoric Polyserial Polyserial Polyserial
## party Polychoric Polyserial Polyserial Polyserial
## contbr_gender Polychoric Polyserial Polyserial Polyserial
## cand_gender 1 Polyserial Polyserial Polyserial
## total_population -0.04162 1 Pearson Pearson
## percent_white 0.06101 -0.3161 1 Pearson
## percent_black -0.08147 0.1389 -0.7777 1
## percent_asian -0.1408 0.31 -0.7548 0.56
## percent_hispanic 0.08144 0.1994 -0.5314 0.1567
## per_capita_income -0.2141 -0.05897 0.1628 -0.1208
## median_rent -0.1678 0.1782 -0.03041 -0.07032
## median_age 0.03728 -0.523 0.4226 -0.2529
## time -0.7313 0.02878 -0.03094 0.03218
## percent_asian percent_hispanic per_capita_income
## contb_receipt_amt Pearson Pearson Pearson
## election_tp Polyserial Polyserial Polyserial
## party Polyserial Polyserial Polyserial
## contbr_gender Polyserial Polyserial Polyserial
## cand_gender Polyserial Polyserial Polyserial
## total_population Pearson Pearson Pearson
## percent_white Pearson Pearson Pearson
## percent_black Pearson Pearson Pearson
## percent_asian 1 Pearson Pearson
## percent_hispanic -0.00376 1 Pearson
## per_capita_income 0.2319 -0.409 1
## median_rent 0.4367 -0.3276 0.7616
## median_age -0.2799 -0.3175 0.1243
## time 0.06582 -0.02828 0.08421
## median_rent median_age time
## contb_receipt_amt Pearson Pearson Pearson
## election_tp Polyserial Polyserial Polyserial
## party Polyserial Polyserial Polyserial
## contbr_gender Polyserial Polyserial Polyserial
## cand_gender Polyserial Polyserial Polyserial
## total_population Pearson Pearson Pearson
## percent_white Pearson Pearson Pearson
## percent_black Pearson Pearson Pearson
## percent_asian Pearson Pearson Pearson
## percent_hispanic Pearson Pearson Pearson
## per_capita_income Pearson Pearson Pearson
## median_rent 1 Pearson Pearson
## median_age -0.08682 1 Pearson
## time 0.06859 -0.02677 1
contb_receipt_amt and party have 0.135 correlation
contb_receipt_amt and election_tp have -0.105 correlation
contb_receipt_amt and per_capital_income have 0.133 correlation
cand_gender and contbr_gender have 0.339 correlation
cand_gender and per_capita_income have -0.214 correlation
party and per_capita_income have -0.216 correlation
party and percent_black have -0.154 correlation
party and percent_asian have -0.181 correlation
party and percent_hispanic have 0.146 correlation
total_population and median_age have -0.523 correlation
total_popultation and percent_white have -0.316 correlation
total_popultation and percent_black 0.139 correlation
total_popultation and percent_asian have 0.31 correlation
total_popultation and percent_hispanic have 0.199 correlation
percent_white and percent_black have -0.777 correlation
percent_white and percent_asian have -0.755 correlation
percent_white and percent_hispanic have -0.531 correlation
percent_asian and per_capita_income have 0.232 correlation
percent_asian and median_rent have 0.437 correlation
percent_asian and median_age have -0.28 correlation
percent_hispanic and median_rent have -0.328 correlation
percent_hispanic and median_age have -0.318 correlation
percent_hispanic and per_capita_income have -0.409 correlation
per_capita_income and median_rent have 0.762 correlation
Some interesting demographic relation.
It shows that people have around 70000 income make the most contribution. The median contribution peak at low income seems very strange.
It shows that people about 40 make the most contribution, but the people about 30 make the highest median contribution. There also two other peaks at 25 and 60 in median amount plot.
In addition, I Group the contb_receipt_amt by contribution by population structure.
This plot is ridiculous. It cannot make sense.
People make more and more contribution when the election day and other big day are closer and closer. I wonder what that valley is.
contb_receipt_amt and party have 0.135 correlationorrelation. contb_receipt_amt and election_tp have -0.105 correlationorrelation.
contb_receipt_amt and per_capital_income have 0.133 correlationorrelation.
Democrat’s total contribution is more than Republican, but Replublican’s mean contribution is more than Democrat. People have around 70000 income make the most contribution. Male make more contribution than female on both sum and mean, but not much. There is some high contribution outlier at low income, which seems very weird.
people about 40 make the most contribution, but the people about 30 make the highest median contribution. There is one weird peak at about 22 in sum contribution.
There are also two other peaks at 25 and 60 in median contribution. People make more and more contribution when the election day and other big day are closer and closer.
cand_gender and contbr_gender have 0.339 correlation
cand_gender and per_capita_income have -0.214 correlation
party and per_capita_income have -0.216 correlation
party and percent_black have -0.154 correlation
party and percent_asian have -0.181 correlation
party and percent_hispanic have 0.146 correlation
total_population and median_age have -0.523 correlation
total_popultation and percent_white have -0.316 correlation
total_popultation and percent_black 0.139 correlation
total_popultation and percent_asian have 0.31 correlation
total_popultation and percent_hispanic have 0.199 correlation
percent_white and percent_black have -0.777 correlation
percent_white and percent_asian have -0.755 correlation
percent_white and percent_hispanic have -0.531 correlation
percent_asian and per_capita_income have 0.232 correlation
percent_asian and median_rent have 0.437 correlation
percent_asian and median_age have -0.28 correlation
percent_hispanic and median_rent have -0.328 correlation
percent_hispanic and median_age have -0.318 correlation
percent_hispanic and per_capita_income have -0.409 correlation
per_capita_income and median_rent have 0.762 correlation
percent_white and percent_black have -0.777 correlation, even higher than 0.762 correlation between per_capita_income and median_rent have
Both Trump and Hillary are more welcome by male rather than female.
Moreover, I try to build a linear model to predict contribution.But it seems not good.
##
## Calls:
## m1: lm(formula = contb_receipt_amt ~ per_capita_income, data = wa)
## m2: lm(formula = contb_receipt_amt ~ per_capita_income + contbr_gender,
## data = wa)
## m3: lm(formula = contb_receipt_amt ~ per_capita_income + contbr_gender +
## party, data = wa)
## m4: lm(formula = contb_receipt_amt ~ per_capita_income + contbr_gender +
## party + median_age, data = wa)
## m5: lm(formula = contb_receipt_amt ~ per_capita_income + contbr_gender +
## party + median_age + percent_white, data = wa)
##
## =====================================================================================================
## m1 m2 m3 m4 m5
## -----------------------------------------------------------------------------------------------------
## (Intercept) -9.646*** -18.461*** -36.857*** -5.524 10.220**
## (1.403) (1.489) (1.517) (3.112) (3.349)
## per_capita_income 0.003*** 0.003*** 0.003*** 0.003*** 0.003***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## contbr_gendermale 20.036*** 12.929*** 12.442*** 12.146***
## (0.929) (0.931) (0.932) (0.932)
## partyothers 174.332*** 175.027*** 175.156***
## (6.814) (6.813) (6.811)
## partyrepublican 71.136*** 72.050*** 73.165***
## (1.329) (1.331) (1.334)
## median_age -0.848*** -0.422***
## (0.074) (0.081)
## percent_white -0.464***
## (0.037)
## -----------------------------------------------------------------------------------------------------
## R-squared 0.018 0.019 0.031 0.032 0.033
## adj. R-squared 0.018 0.019 0.031 0.032 0.033
## sigma 242.979 241.651 240.144 240.086 240.015
## F 5113.117 2676.403 2213.295 1798.083 1526.156
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1970336.620 -1881166.331 -1879460.552 -1879394.094 -1879313.501
## Deviance 16829704435.971 15905510725.251 15707535943.265 15699872892.314 15690584822.102
## AIC 3940679.240 3762340.662 3758933.104 3758802.189 3758643.001
## BIC 3940710.922 3762382.722 3758996.193 3758875.793 3758727.121
## N 285064 272379 272379 272379 272379
## =====================================================================================================
That is very interesting! There is strong and interesting relation between contributor occupation and candidate’s backgroud.
Hillary’s contribution went down to a valley after email controversy and then went up. After the peak at around first week in October, the contribution slump down. Cannot figure out why. Trump’s contribution is steadily low compared with Hillary.
Almost the same trend regardless of per capita income!
furthermore, I subset the map data for WA.
If democrat total contribution more than republican, the whole zipcode area will be filled with blue otherwise red.The map shows that Democrat and republican seems matched.
The map shows that republican receive more contributions by zipcode.
Hillary get a lot of big pocket money contribution while the democratic nomination date coming closer. Hillary’s contribution went down to a valley after email controversy and then went up. After the peak at around first week in October, the contribution slump down. Cannot figure out why. Trump’s contribution is steadily low compared with Hillary. Though finally Hillary win in Washington State, it is hard to say there is really some strong relation between contribution and vote.
Both Trump and Hillary are more welcome by male rather than female. Female get less contributions than male.
##
## Call:
## lm(formula = contb_receipt_amt ~ per_capita_income + contbr_gender +
## party + median_age + percent_white, data = wa)
##
## Coefficients:
## (Intercept) per_capita_income contbr_gendermale
## 10.219845 0.002914 12.145989
## partyothers partyrepublican median_age
## 175.155750 73.164645 -0.422120
## percent_white
## -0.464180
After faceting by age, the plot shows that Hillary is more welcomed by female rather than male.
Sanders is welcomed by all age group no matter male or female.
It seems that trump have good contribution distribution.
After working all night, finnaly I add annotation to different facet at different coordination.
Hillary’s contribution went down to a valley after email controversy and then went up. After the peak at around first week in October, the contribution slump down. Cannot figure out why. Trump’s contribution is steadily low compared with Hillary. Though finally Hillary win in Washington State, it is hard to say there is really some strong relation between contribution and vote.
Though Hillary have more contribution, it seems Trump’s contribution have broader spread. But there are too many blank district. It is hard to get a conclution.
This is the most exicting plot in the whole analysis. The plot shows strong and interesting relation between contributor occupation and candidate’s backgroud.
As for Hillary, the top 1 occupation is attorney, and the 11th is lawyer. Yeah, we all know that Hillary once belong to this group The second is homemaker, it tells that Hillary really welcomed by female. We can also see that Hillary is welcomed by not employed and educate industry.
As for Trump, the top 1 occupation is self-employed. That is amazing. Perhaps Trump have some character appreciated by self-employed, such as courage. And then we can see that in top 10, there are CEO, president, business owner, owner. What is more, Trump also have a group donator with occupation like contractor, project manager, real estate.
The Hillary is supported by nurse while Trump is supported by farmer. # Reflection
The contribution map shows that many district have no contribution at all. I don’t know this is because of data quality or that is the truth. Maybe Washington State is not a good data set to analyze election compaign.
It is hard to figure out strong relation between election result and contribution since Trump paid himself.
There is really strong relation between contribution and date.
The strongest relation is candidate and their donator’s occupation, which can be indicated by bar plot but not correlation.
Since contribution have limit and influence by many factor, I think building model to predict contribution is nearly impossible.
The data quality is not good enough and even missing due to different expreesion, manual error and some other unknown reason.
It would be helpful to import some vote, demographic and geographic data to cross-validate and supply the election contribution data set.
I think using the larger dataset like the whole USA data set would discover some more interesting relaion.